Text Chunking based on a Generalization of Winnow
نویسندگان
چکیده
This paper describes a text chunking system based on a generalization of the Winnow algorithm. We propose a general statistical model for text chunking which we then convert into a classification problem. We argue that the Winnow family of algorithms is particularly suitable for solving classification problems arising from NLP applications, due to their robustness to irrelevant features. However in theory, Winnow may not converge for linearly non-separable data. To remedy this problem, we employ a generalization of the original Winnow method. An additional advantage of the new algorithm is that it provides reliable confidence estimates for its classification predictions. This property is required in our statistical modeling approach. We show that our system achieves state of the art performance in text chunking with less computational cost then previous systems.
منابع مشابه
Text Chunking using Regularized Winnow
Many machine learning methods have recently been applied to natural language processing tasks. Among them, the Winnow algorithm has been argued to be particularly suitable for NLP problems, due to its robustness to irrelevant features. However in theory, Winnow may not converge for nonseparable data. To remedy this problem, a modification called regularized Winnow has been proposed. In this pap...
متن کاملA SNoW Based Supertagger with Application to NP Chunking
Supertagging is the tagging process of assigning the correct elementary tree of LTAG, or the correct supertag, to each word of an input sentence1 . In this paper we propose to use supertags to expose syntactic dependencies which are unavailable with POS tags. We first propose a novel method of applying Sparse Network of Winnow (SNoW) to sequential models. Then we use it to construct a supertagg...
متن کاملRapid Development of Nlp Modules with Memory-based Learning
The need for software modules performing natural language processing (NLP) tasks is growing. These modules should perform efficiently and accurately, while at the same time rapid development is often mandatory. Recent work has indicated that machine learning techniques in general, and memory-based learning (MBL) in particular, offer the tools to meet both ends. We present examples of modules tr...
متن کاملتعیین مرز و نوع عبارات نحوی در متون فارسی
Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...
متن کاملText Classification in Information Retrieval using Winnow
Text classification in Information Retrieval can be done by using a linear classifier. Linear learning algorithms classify documents by learning a linear separator based on the document features. Littlestone's Winnow is such as linear learning algorithm. I have described three learning algorithms, based on Littlestone's Winnow, which can be applied to perform this task. Modifications of the alg...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of Machine Learning Research
دوره 2 شماره
صفحات -
تاریخ انتشار 2002